Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow concurrent broker restarts within same broker rack #1001

Merged
merged 11 commits into from
Aug 4, 2023

Conversation

ctrlaltluc
Copy link
Contributor

@ctrlaltluc ctrlaltluc commented Jun 22, 2023

Description

Cluster restarts (during rolling upgrades) are time consuming for big clusters.

Given that clients might be using Kafka broker.rack config to spread replicas across N AZs, we can speed up the cluster restart by allowing concurrent broker restarts within the same AZ. Allowing only same AZ concurrent restarts makes sure that we always have at most 1/Nth of the replicas of any topic-partition offline, not more, while the other (N-1)/N replicas are fine.
For example, using a common 3AZ setup, this would ensure that at most 1/3rd of the replicas are offline, while the other 2/3rds are in-sync.

Key changes

  • add configuration parameter RollingUpgradeConfig.ConcurrentBrokerRestartsPerRackCount which is by default 1
  • allow concurrent broker restarts only if there are no failures or restarts in other AZs and if there are no more failures or restarts in the same AZ than specified by RollingUpgradeConfig.ConcurrentBrokerRestartsPerRackCount and RollingUpgradeConfig.FailureThreshold
  • do not change the semantic of RollingUpgradeConfig.FailureThreshold by allowing (same as before) multiple failures even if not in the same AZ
  • if broker.rack is not properly configured for all brokers, consider each broker as being in a separate AZ, meaning we err on the side of caution and restart at most 1 broker at once
  • concurrent restarts are triggered only if Cruise Control has the RackAwareDistributionGoal ready and not recently violated (either as fixable or unfixable)

Notes

  • in order to enable multiple concurrent restarts, we must set both RollingUpgradeConfig.ConcurrentBrokerRestartsPerRackCount and RollingUpgradeConfig.FailureThreshold to a value above 1

Type of Change

  • New Feature

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

Copy link
Member

@panyuenlau panyuenlau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the contribution! Took a quick look at the implementation and it looks good. Will come back for the tests when I get some more time tmr

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
api/v1beta1/kafkacluster_types.go Outdated Show resolved Hide resolved
@bartam1
Copy link
Contributor

bartam1 commented Jun 27, 2023

Hello @ctrlaltluc !
Thank you for the contribution. We are reviewing it.
I can see a problematic case when the rack awareness is only enabled after some time. In this case it can happen that there are some topics without rack awareness. Restarting multiple brokers in same rack which contains all of those topics partitions can be problem. (com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal is turned off or violated)

What if we dont do restart based on racks rather we impelement a check which maps broker-topic-partitions on the kafka cluster and checks that the actual broker contains a partition which is not appeared on another healthy broker (also check is the partition in sync) . If there is no such partition then we can restart the broker. It would be more advanced if before the broker restart there would be a new leader election (for that partition which is on the actual broker what we want to restart) so the consuming producing would be totally seamless (no need to wait for the new leader election by controller). We can introduce a new field which would specify how many in sync replicas should be remain when rollingUpgrade happens. (In this way if another broker goes down unintentionally under rolling upgrade then cluster can be also helathy). After the rolling upgrade we should checks that everything is in sync and when yes then we should execute the kafka-preferred-replica-election command to achieve that state where every leader is the prefered one.

The https://github.com/linkedin/cruise-control/wiki/REST-APIs#demote-a-list-of-brokers-from-the-kafka-cluster can also be used before we restart a broker. (Actually it also can be used before remove_broker in this case the remove_broker operation could be finished sooner because there were be no leader on the removable broker which is problematic under heavy load).

Execute a demotebroker operation in rolling upgrade. When the broker is demoted do the restart and go to the next broker. If we do this it is possible that scennario when the rest of the healthy broker getting too much load.

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
func getBrokerAzMap(cluster *v1beta1.KafkaCluster) map[int32]string {
brokerAzMap := make(map[int32]string)
for _, broker := range cluster.Spec.Brokers {
brokerRack := getBrokerRack(broker.ReadOnlyConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should take care of that case when the broker config is coming from the brokerConfigGroup.
Here we should use this function to extend:

func (b *Broker) GetBrokerConfig(kafkaClusterSpec KafkaClusterSpec) (*BrokerConfig, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the broker rack be specified via broker config?

I assumed the source of truth for broker rack is the read-only config, as in broker groups you can define it more freely, not necessarily with rack semantics, plus that the broker.rack needs to be set in read-only configs for it to work.

Please lmk if I am mistaken.

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka.go Show resolved Hide resolved
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka_test.go Show resolved Hide resolved
@ctrlaltluc
Copy link
Contributor Author

ctrlaltluc commented Jul 17, 2023

Hello @ctrlaltluc ! Thank you for the contribution. We are reviewing it. I can see a problematic case when the rack awareness is only enabled after some time. In this case it can happen that there are some topics without rack awareness. Restarting multiple brokers in same rack which contains all of those topics partitions can be problem. (com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal is turned off or violated)

What if we dont do restart based on racks rather we impelement a check which maps broker-topic-partitions on the kafka cluster and checks that the actual broker contains a partition which is not appeared on another healthy broker (also check is the partition in sync) . If there is no such partition then we can restart the broker. It would be more advanced if before the broker restart there would be a new leader election (for that partition which is on the actual broker what we want to restart) so the consuming producing would be totally seamless (no need to wait for the new leader election by controller). We can introduce a new field which would specify how many in sync replicas should be remain when rollingUpgrade happens. (In this way if another broker goes down unintentionally under rolling upgrade then cluster can be also helathy). After the rolling upgrade we should checks that everything is in sync and when yes then we should execute the kafka-preferred-replica-election command to achieve that state where every leader is the prefered one.

The https://github.com/linkedin/cruise-control/wiki/REST-APIs#demote-a-list-of-brokers-from-the-kafka-cluster can also be used before we restart a broker. (Actually it also can be used before remove_broker in this case the remove_broker operation could be finished sooner because there were be no leader on the removable broker which is problematic under heavy load).

Execute a demotebroker operation in rolling upgrade. When the broker is demoted do the restart and go to the next broker. If we do this it is possible that scennario when the rest of the healthy broker getting too much load.

Hi Marton @bartam1! I would like to reach a conclusion here before addressing the other PR comments.
Please note that, as I mentioned in the CRD field description, enabling concurrent restarts requires CC com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal to be enabled so we ensure we have uniform replica distribution across AZs. The reason for this is not only to ensure more than 1 replica per AZ, but also to balance load in other AZs, while some replicas in a certain AZ are down.

Regarding the suggestion for keeping a map of topic-partitions to brokers, I believe this would not be easy to implement as to ensure consistency. Topic-partitions might change between queries and we must keep that map consistent with actual cluster metadata. I would keep this option as a last resort.

Regarding the suggestion for using demote, I agree that would be safe, but that would also make the restart much slower and might not be necessary. I first want to understand if it is necessary. Our foremost motivation for this feature was to increase the speed of cluster restarts during rolling upgrade, as currently 60 broker clusters take a long time to restart (3h+).

Regarding the problem you raised, I'll try to summarize to make sure I understand that what you mention is not covered and needs addressing. Please correct me if I am wrong.

Problematic use case:

  • Kafka clusters with multiple brokers, broker.rack disabled
  • client sets proper broker.rack read-only config for all brokers
  • Kafka starts enforcing rack distribution
  • before this finishes, client sets concurrent broker restarts
    Expected behavior: brokers are not yet restarted concurrently
    Actual behavior: brokers are restarted concurrently and might match all of a topic's replicas as rack distribution is not yet finalized by Kafka and CC

Is this what you mean is not covered?

If this is accurate, then I would argue that:

  • prerequisite for enabling CC com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal was not met
  • if the above is enabled and CC starts a replica rebalance, the rolling restart will be prevented anyway as long as the CC task is running

Wdyt?

@bartam1
Copy link
Contributor

bartam1 commented Jul 20, 2023

if the above is enabled and CC starts a replica rebalance, the rolling restart will be prevented anyway as long as the CC task is running

Hello @ctrlaltluc
Yes, the problematic use-case is that one what you wrote.
I'm not sure regarding this:

  • if the above is enabled and CC starts a replica rebalance, the rolling restart will be prevented anyway as long as the CC task is running
    I think the rolling restart will be not prevented in that case and we also cannot be sure that the RackAwareDistributionGoal self healing operation is started at all.

This is an edge case.

If we want to be sure we can check two things before multiple broker restart:

  1. Is the RackAwareDistributionGoal is set in the CC configuration
  2. Is there RackAwareDistributionGoal violation (CC state/AnomalyDetectorState/recentGoalViolations)
    What do you think?

Regarding the demoting: It happens instantly when the target broker is in sync. I tested this with a continuously producer-consumer. There were no any communication interruption under demoting. Otherwise when there is an active producer to the leader and there is a rolling upgrade (without demotion) there will be interruption (until the new leader is elected). I'm not sure about that is it worth or not the effort to implement what I wrote regarding mapping and demotion. I don't have much experience with production Kafka clusters. If it is worth it can be done. I saw that as a possible improvement and I was curious about your opinion.

@ctrlaltluc
Copy link
Contributor Author

I was curious about your opinion.

Thanks @bartam1! Your point makes perfect sense.

I will add these additional checks before the decision to restart more than 1 broker:

  • RackAwareDistributionGoal is set in the CC configuration
  • RackAwareDistributionGoal is not violated

I will have a look into demoting too, as it would make it smoother for clients (no interruptions), makes sense. Will come back on this with either code changes or a comment.

@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts branch from 3398dc9 to 43a982a Compare July 21, 2023 14:30
@bartam1
Copy link
Contributor

bartam1 commented Jul 22, 2023

I was curious about your opinion.

Thanks @bartam1! Your point makes perfect sense.

I will add these additional checks before the decision to restart more than 1 broker:

* RackAwareDistributionGoal is set in the CC configuration

* RackAwareDistributionGoal is not violated

I will have a look into demoting too, as it would make it smoother for clients (no interruptions), makes sense. Will come back on this with either code changes or a comment.

Thank you for doing!

@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts branch from 43a982a to 69eee7b Compare July 26, 2023 10:19
Copy link
Contributor Author

@ctrlaltluc ctrlaltluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartam1 @panyuenlau I have addressed all comments on the PR, including the check for RackAwareDistributionGoal. Please have another look, thanks in advance for your time!

@bartam1 I have not implemented demote brokers, as it is not necessary. By default, Kafka brokers have controlled.shutdown.enable default true. This means that any graceful Kafka shutdown (i.e. anything excluding SIGKILL) will ensure that leaders are moved.
LE: I also tested demote myself, and the behavior is the same as normal broker restarts (leaders are moved). However, demote has some downsides: 1/ it takes a long time (~10 min per broker for brokers with ~4500 replicas) as demote moves the demoted broker as last in any set of in-sync replicas and this takes time, even though there is no data movement, 2/ we later need to add the demoted broker back, as it would otherwise remain in recently demoted brokers, which complicates the code without added benefit.

Please note that the tests will still be failing, until #1002 is merged and a new API module release is done.

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
func getBrokerAzMap(cluster *v1beta1.KafkaCluster) map[int32]string {
brokerAzMap := make(map[int32]string)
for _, broker := range cluster.Spec.Brokers {
brokerRack := getBrokerRack(broker.ReadOnlyConfig)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the broker rack be specified via broker config?

I assumed the source of truth for broker rack is the read-only config, as in broker groups you can define it more freely, not necessarily with rack semantics, plus that the broker.rack needs to be set in read-only configs for it to work.

Please lmk if I am mistaken.

pkg/resources/kafka/kafka_test.go Show resolved Hide resolved
@ctrlaltluc ctrlaltluc marked this pull request as ready for review July 26, 2023 10:27
@ctrlaltluc ctrlaltluc requested a review from a team as a code owner July 26, 2023 10:27
@ctrlaltluc ctrlaltluc requested review from panyuenlau and bartam1 July 26, 2023 10:27
panyuenlau
panyuenlau previously approved these changes Jul 26, 2023
Copy link
Member

@panyuenlau panyuenlau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies - was zoned out for doing my own implementation. Thx for the great work! Only left a couple of quick comments (more like nits)

return brokerAzMap
}

func getBrokerRack(readOnlyConfig string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: It looks like the latest implementation doesn't require this func, can we remove it since it's not being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch! Forgot about it. Will address in another commit, thanks for paying attention!

}
}

func getBrokerAzMap(cluster *v1beta1.KafkaCluster) map[int32]string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: It'd be appreciated if we can add some quick unit tests for func like this one, but since the caller handleRollingUpgrade is tested, I am not going to ask for any more changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this properly. I added unit tests for it in cae78f5, please have a look. Thanks!

P.S. I saw you were doing heavy lifting with kraft, so no worries and thanks for the review. :)

@panyuenlau
Copy link
Member

FYI - The new API tag https://github.com/banzaicloud/koperator/tree/api/v0.28.7 should be available soon

@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts branch from cae78f5 to 997f46e Compare July 28, 2023 07:45
@ctrlaltluc ctrlaltluc requested a review from panyuenlau July 28, 2023 08:20
@ctrlaltluc
Copy link
Contributor Author

@panyuenlau @bartam1 the PR is now final and ready for review. CI fails because of the flaky nodeports test (tried once to retrigger it with an empty commit, gave up), otherwise all tests are passing. Thanks for your patience!

@bartam1
Copy link
Contributor

bartam1 commented Jul 30, 2023

I think this can have (#1023) effect for this PR
Maybe we should not do multiple broker restarts when the brokers are controllers.
When multiple brokers (controllers) are restarting it has bigger chance to not be enough controller to make quorum.

@ctrlaltluc
Copy link
Contributor Author

I think this can have (#1023) effect for this PR Maybe we should not do multiple broker restarts when the brokers are controllers. When multiple brokers (controllers) are restarting it has bigger chance to not be enough controller to make quorum.

@bartam1 had a look on #1023 and indeed, if KRaft is enabled and the controllers are not spread across AZs (racks), then there are edge cases where multiple restarts within the same rack/AZ might be an issue.

I see 2 options to move forward:

  1. Merge this PR as it is now, then when Initial KRaft support #1023 PR is merged, we will have the spec.kraft available and we can add another check: concurrent restarts > 1, terminating or pending pods > 1 and exists any controller in the same AZ/rack among terminating or pending pods, then do not restart another pod (to ensure at most 1 KRaft controller is restarted)
  2. Wait until Initial KRaft support #1023 is merged, then add the check above in this PR.

@bartam1 @panyuenlau lmk your thoughts.

@bartam1
Copy link
Contributor

bartam1 commented Aug 3, 2023

I think this can have (#1023) effect for this PR Maybe we should not do multiple broker restarts when the brokers are controllers. When multiple brokers (controllers) are restarting it has bigger chance to not be enough controller to make quorum.

@bartam1 had a look on #1023 and indeed, if KRaft is enabled and the controllers are not spread across AZs (racks), then there are edge cases where multiple restarts within the same rack/AZ might be an issue.

I see 2 options to move forward:

1. Merge this PR as it is now, then when [Initial KRaft support #1023](https://github.com/banzaicloud/koperator/pull/1023) PR is merged, we will have the `spec.kraft` available and we can add another check: concurrent restarts > 1, terminating or pending pods > 1 and exists any controller in the same AZ/rack among terminating or pending pods, then do not restart another pod (to ensure at most 1 KRaft controller is restarted)

2. Wait until [Initial KRaft support #1023](https://github.com/banzaicloud/koperator/pull/1023) is merged, then add the check above in this PR.

@bartam1 @panyuenlau lmk your thoughts.

I agree. We should merge this PR. We can add modifications later if needed.

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/scale/scale.go Outdated Show resolved Hide resolved
@ctrlaltluc ctrlaltluc requested a review from bartam1 August 3, 2023 14:29
Copy link
Contributor

@bartam1 bartam1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thank you for implementing it with consideration of the edge cases.

Copy link
Member

@panyuenlau panyuenlau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the consideration for the potentional impact on KRaft support @bartam1 @ctrlaltluc,
will keep this new feature in mind when we get there during the KRaft implementation.

@ctrlaltluc
Copy link
Contributor Author

Thanks for your thorough reviews as well! Looking forward to getting this merged.

@bartam1 bartam1 merged commit 4329189 into banzaicloud:master Aug 4, 2023
@ctrlaltluc ctrlaltluc deleted the banzaicloud/parallel-restarts branch August 7, 2023 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants